<h1>1. Keypoint Evaluation</h1>
<p>This page describes the <i>keypoint evaluation metrics</i> used by COCO. The evaluation code provided here can be used to obtain results on the publicly available COCO validation set. It computes multiple metrics described below. To obtain results on the COCO test set, for which ground-truth annotations are hidden, generated results must be <a href="#upload">uploaded</a> to the evaluation server. The exact same evaluation code, described below, is used to evaluate results on the test set.</p>

<h2>1.1. Evaluation Overview</h2>
<p>The COCO keypoint task requires simultaneously detecting objects and localizing their keypoints (object locations are not given at test time). As the task of <i>simultaneous detection and keypoint estimation</i> is relatively new, we chose to adopt a novel metric inspired by object detection metrics. For simplicity, we refer to this task as <i>keypoint detection</i> and the prediction algorithm as the <i>keypoint detector</i>. We suggest reviewing the evaluation metrics for <a href="#detection-eval">object detection</a> before proceeding.</p>
<p>The core idea behind evaluating keypoint detection is to mimic the evaluation metrics used for object detection, namely average precision (AP) and average recall (AR) and their variants. At the heart of these metrics is a similarity measure between ground truth objects and predicted objects. In the case of object detection, the IoU serves as this similarity measure (for both boxes and segments). Thesholding the IoU defines matches between the ground truth and predicted objects and allows computing precision-recall curves. To adopt AP/AR for keypoints detection, we only need to define an analogous similarity measure. We do so by defining an <i>object keypoint similarity</i> (OKS) which plays the same role as the IoU.</p>

<h2>1.2. Object Keypoint Similarity</h2>
<p>For each object, ground truth keypoints have the form [x<sub>1</sub>,y<sub>1</sub>,v<sub>1</sub>,...,x<sub>k</sub>,y<sub>k</sub>,v<sub>k</sub>], where x,y are the keypoint locations and v is a visibility flag defined as v=0: not labeled, v=1: labeled but not visible, and v=2: labeled and visible. Each ground truth object also has a scale s which we define as the square root of the object segment area. For details on the ground truth format please see the <a href="#download">download</a> page.</p>
<p>For each object, the keypoint detector must output keypoint locations and an object-level confidence. Predicted keypoints for an object should have the same form as the ground truth: [x<sub>1</sub>,y<sub>1</sub>,v<sub>1</sub>,...,x<sub>k</sub>,y<sub>k</sub>,v<sub>k</sub>]. However, the detector's predicted v<sub>i</sub> are <i>not</i> currently used during evaluation, that is the keypoint detector is not required to predict per-keypoint visibilities or confidences.</p>
<p>We define the object keypoint similarity (OKS) as:</p>
<div class="json fontMono">
  <div class="jsonreg">
    <b>OKS = Σ<sub>i</sub>[exp(-d<sub>i</sub><sup>2</sup>/2s<sup>2</sup>&kappa;<sub>i</sub><sup>2</sup>)&delta;(v<sub>i</sub>>0)] / Σ<sub>i</sub>[&delta;(v<sub>i</sub>>0)]</b>
  </div>
</div>
<p>The d<sub>i</sub> are the Euclidean distances between each corresponding ground truth and detected keypoint and the v<sub>i</sub> are the visibility flags of the ground truth (the detector's predicted v<sub>i</sub> are not used). To compute OKS, we pass the d<sub>i</sub> through an unnormalized Guassian with standard deviation s&kappa;<sub>i</sub>, where s is the object scale and &kappa;<sub>i</sub> is a per-keypont constant that controls falloff. For each keypoint this yields a keypoint <i>similarity</i> that ranges between 0 and 1. These similarities are averaged over all labeled keypoints (keypoints for which v<sub>i</sub>>0). Predicted keypoints that are not labeled (v<sub>i</sub>=0) do not affect the OKS. Perfect predictions will have OKS=1 and predictions for which all keypoints are off by more than a few standard deviations s&kappa;<sub>i</sub> will have OKS~0. The OKS is analogous to the IoU. Given the OKS, we can compute AP and AR just as the IoU allows us to compute these metrics for box/segment detection.</p>

<h2>1.3. Tuning OKS</h2>
<p>We tune the &kappa;<sub>i</sub> such that the OKS is a perceptually meaningful and easy to interpret similarity measure. First, using 5000 redundantly annotated images in val, for each keypoint type i we measured the per-keypoint standard deviation &sigma;<sub>i</sub> with respect to object scale s. That is we compute <b>&sigma;<sub>i</sub><sup>2</sup>=E[d<sub>i</sub><sup>2</sup>/s<sup>2</sup>]</b>. &sigma;<sub>i</sub> varies substantially for different keypoints: keypoints on a person's body (shoulders, knees, hips, etc.) tend to have a &sigma; much larger than on a person's head (eyes, nose, ears).</p>
<p>To obtain a perceptually meaningful and interpretable similarity metric we set <b>&kappa;<sub>i</sub>=2&sigma;<sub>i</sub></b>. With this setting of &kappa;<sub>i</sub>, at one, two, and three standard deviations of d<sub>i</sub>/s the keypoint similarity exp(-d<sub>i</sub><sup>2</sup>/2s<sup>2</sup>&kappa;<sub>i</sub><sup>2</sup>) takes on values of e<sup>-1/8</sup>=.88, e<sup>-4/8</sup>=.61 and e<sup>-9/8</sup>=.32. As expected, human annotated keypoints are normally distributed (ignoring occasional outliers). Thus, recalling the <a href="https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule" target="_blank">68–95–99.7 </a>rule, setting &kappa;<sub>i</sub>=2&sigma;<sub>i</sub> means that 68%, 95%, and 99.7% of human annotated keypoints should have a keypoint similarity of .88, .61, or .32 or higher, respectively (in practice the percentages are 75%, 95% and 98.7%).</p>
<p>The OKS is the average keypoint similarity across all (labeled) object keypoints. Below we plot the predicted OKS distribution with &kappa;<sub>i</sub>=2&sigma;<sub>i</sub> assuming 10 independent keypoints per object (blue curve) and the actual distribution of human OKS scores on the dually annotated data (green curve):</p>
<p><img src="images/keypoints-oks-person.png" class="wide80" align="center"/></p>
<p>The curves don't match exactly for a few reasons: (1) object keypoints are not independent, (2) the number of labeled keypoints per objects varies, and (3) the real data contains 1-2% outliers (most of which are caused by annotators mistaking left for right or annotating the wrong person when two people are nearby). Nevertheless, the behavior is roughly as expected. We conclude with a few observations about human performance: (1) at OKS of .50, human performance is nearly perfect (95%), (2) median human OKS is ~.91, (3) human performance drops rapidly after an OKS of .95. Note that this OKS distribution can be used to predict human AR (as AR doesn't depend on false positives).</p>

<h1>2. Metrics</h1>
<p>The following 10 metrics are used for characterizing the performance of a keypoint detector on COCO:</p>
<div class="json fontMono">
  <div class="jsonreg"><b>Average Precision (AP):</b></div>
  <div class="jsonk">AP</div><div class="jsonv">% AP at OKS=.50:.05:.95 <b>(primary challenge metric)</b></div>
  <div class="jsonk">AP<sup>OKS=.50</sup></div><div class="jsonv">% AP at OKS=.50 (loose metric)</div>
  <div class="jsonk">AP<sup>OKS=.75</sup></div><div class="jsonv">% AP at OKS=.75 (strict metric)</div>
  <div class="jsonreg"><b>AP Across Scales:</b></div>
  <div class="jsonk">AP<sup>medium</sup></div><div class="jsonv">% AP for medium objects: 32<sup>2</sup> &lt; area &lt; 96<sup>2</sup></div>
  <div class="jsonk">AP<sup>large</sup></div><div class="jsonv">% AP for large objects: area &gt; 96<sup>2</sup></div>
  <div class="jsonreg"><b>Average Recall (AR):</b></div>
  <div class="jsonk">AR</div><div class="jsonv">% AR at OKS=.50:.05:.95</div>
  <div class="jsonk">AR<sup>OKS=.50</sup></div><div class="jsonv">% AR at OKS=.50</div>
  <div class="jsonk">AR<sup>OKS=.75</sup></div><div class="jsonv">% AR at OKS=.75</div>
  <div class="jsonreg"><b>AR Across Scales:</b></div>
  <div class="jsonk">AR<sup>medium</sup></div><div class="jsonv">% AR for medium objects: 32<sup>2</sup> &lt; area &lt; 96<sup>2</sup></div>
  <div class="jsonk">AR<sup>large</sup></div><div class="jsonv">% AR for large objects: area &gt; 96<sup>2</sup></div>
</div></br>
<ol class="fontSmall">
  <li>Unless otherwise specified, AP and AR are averaged over multiple OKS values (.50:.05:.95).</li>
  <li>As discussed, we set &kappa;<sub>i</sub>=2&sigma;<sub>i</sub> for each keypoint type i. For people, the &sigma;'s are .026, .025, .035, .079, .072, .062, .107, .087, &amp; .089 for the nose, eyes, ears, shoulders, elbows, wrists, hips, knees, &amp; ankles, respectively.</li>
  <li>AP (averaged across all 10 OKS thresholds) will determine the challenge winner. This should be considered the single most important metric when considering keypoint performance on COCO.</li>
  <li>All metrics are computed allowing for at most 20 top-scoring detections per image (we use 20 detections, not 100 as in the object detection challenge, as currently person is the only category with keypoints).</li>
  <li>Small objects (segment area < 32<sup>2</sup>) do not contain keypoint annotations.</li>
  <li>For objects without labeled keypoints, including crowds, we use a lenient heuristic that allows matching of detections based on hallucinated keypoints (placed within the ground truth objects so as to maximize OKS). This is very similar to how ignore regions are handled for detection with boxes/segments. See the code for details.</li>
  <li>Each object is given equal importance, regardless of the number of labeled/visible keypoints. We do not filter objects with only a few keypoints, nor do we weight object examples by the number of keypoints present.</li>
</ol>

<h1>3. Evaluation Code</h1>
<p>Evaluation code is available on the <a href="https://github.com/cocodataset/cocoapi" target="_blank">COCO github</a>. Specifically, see either <a href="https://github.com/cocodataset/cocoapi/blob/master/MatlabAPI/CocoEval.m" target="_blank">CocoEval.m</a> or <a href="https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py" target="_blank">cocoeval.py</a> in the Matlab or Python code, respectively. Also see <span class="fontMono">evalDemo</span> in either the Matlab or Python code (<a href="https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocoEvalDemo.ipynb" target="_blank">demo</a>). Before running the evaluation code, please prepare your results in the format described on the <a href="#format-results">results format</a> page.</p>

<h1>4. Analysis Code</h1>
<p>In addition to the evaluation code, we also provide a function <span class="fontMono"><a href="https://github.com/matteorr/coco-analyze/blob/release/COCOanalyze_demo.ipynb">analyze()</a></span> for performing a detailed breakdown of the errors in multi-instance keypoint estimation. This is described extensively in the paper <a href="http://www.vision.caltech.edu/~mronchi/projects/PoseErrorDiagnosis/" target="_blank">Benchmarking and Error Diagnosis in Multi-Instance Pose Estimation</a> by Ronchi et al. The <a href="https://github.com/matteorr/coco-analyze">code</a> generates plots like this:</p>
<p><img src="images/keypoints-analysis-CMU.png" class="wide80" align="center"/></p>
<p>We show the results of the analysis of the <a href="http://arxiv.org/abs/1611.08050" target="_blank">Pose Affinity Fields</a> detector from Zhe Cao et al., winner of the <a href="#keypoints-2016">2016 Keypoint Challenge</a> at <a href="http://image-net.org/challenges/talks/2016/ECCV2016_workshop_presentation_keypoint.pdf">ECCV 2016</a>.</p>
<p>The plot summarizes the impact of all types of error on the performance of a multi-instance pose estimation algorithm. It is composed of a series of Precision Recall (PR) curves where each curve is guaranteed to be strictly higher than the previous as the algorithm's detections are progressively corrected at an (arbitrary) OKS threshold of .9. The legend shows the Area Under the Curve (AUC). The curves are as follows (check <a href="http://www.vision.caltech.edu/~mronchi/projects/PoseErrorDiagnosis/">project page</a> for a full description):</p>
<ol class="fontSmall">
  <li><b>Original Dts.</b>: PR obtained with the original detections at OKS=.9 (AP at strict OKS), area under curve corresponds to AP<sup>OKS=.9</sup> metric.</li>
  <li><b>Miss</b>: PR at OKS=.9 (AP at strict OKS), after all miss errors have been corrected. A miss is a large localization error: the detected keypoint is not within the proximity of the correct body part.</li>
  <li><b>Swap</b>: PR at OKS=.9 (AP at stric OKS), after all swap errors have been corrected. A swap is due to the confusion between the same body part of different people in an image (i.e. right elbow).</li>
  <li><b>Inversion</b>: PR at OKS=.9 (AP at stric OKS), after all inversion errors have been corrected. An inversion is due to the confusion of body parts within the same person (i.e. left and right elbow).</li>
  <li><b>Jitter</b>: PR at OKS=.9 (AP at strict OKS), after all jitter errors have been corrected. A jitter is a small localization error: the detected keypoint is within the proximity of the correct body part.</li>
  <li><b>Opt. Score</b>: PR at OKS=.9 (AP at strict OKS), after all the algorithm's detections have been rescored using an oracle function computed at evaluation time. As a result of the rescoring the number of matches between detections and ground-truth instances is maximized.</li>
  <li><b>FP</b>: PR after all background fps are removed. FP is a step function that is 1 until max recall is reached then drops to 0 (the curve is smoother after averaging across categories).</li>
  <li><b>FN</b>: PR after all remaining errors are removed (trivially AP=1).</li>
</ol>
<p>In the case of the above detector, overall AP at OKS=.9 is .327. Correcting all the <i>miss</i>  errors results in a large improvement of the AP to .415. Smaller gains are obtained when correcting <i>swaps</i>, .448, and <i>inversions</i>, .545. Another large improvement is obtained when <i>jiitter</i> errors are removed, resulting in  an AUC of .859. This shows what would the performance be if the CMU algorithm had perfect localization of keypoints. When localization is very good, the impact of <i>confidence score errors</i> is not as significant, but still results in an AUC improvement of about 2% (.879). Optimally scoring detections greatly diminishes the impact of <i>Background False Positives</i>, as detections rarely remain unmatched. Finally, removing <i>Background False Negatives</i> provides the remaining AUC to obtain perfect performance. In summary, CMU’s errors at OKS=.9 are dominated by imperfect localization, mostly jitter errors, and missed detections.</p>
<p>For a given detector, the code generates a total of 180 plots, analyzing all the types of errors at 3 area ranges (medium, large, all) and 10 evaluation thresholds (.5::.05::.95). The analysis code will automatically generate a <a href="https://github.com/matteorr/coco-analyze/blob/release/reports/cmu_performance_report.pdf">pdf report</a> containing a summary of the overall performance, the sensitivity of a method's behaviour to the different types of errors and their impact on performance, and several examples of the most significant failure cases.</p>
<p><b>Note:</b> <span class="fontMono">analyze()</span> can take significant time to run, please be patient. As such, we typically do not run this code on the evaluation server; you must run the code locally using the validation set. You can find the <span class="fontMono">analyze()</span> function as part of this <a href="https://github.com/matteorr/coco-analyze">GitHub repository</a>.</p>
